The Hidden TAG Model: Synchronous Grammars for Parsing Resource-Poor Languages
نویسندگان
چکیده
This paper discusses a novel probabilistic synchronous TAG formalism, synchronous Tree Substitution Grammar with sister adjunction (TSG+SA). We use it to parse a language for which there is no training data, by leveraging off a second, related language for which there is abundant training data. The grammar for the resource-rich side is automatically extracted from a treebank; the grammar on the resource-poor side and the synchronization are created by handwritten rules. Our approach thus represents a combination of grammar-based and empirical natural language processing. We discuss the approach using the example of Levantine Arabic and Standard Arabic. 1 Parsing Arabic Dialects and Tree Adjoining Grammar The Arabic language is a collection of spoken dialects and a standard written language. The standard written language is the same throughout the Arab world, Modern Standard Arabic (MSA), which is also used in some scripted spoken communication (news casts, parliamentary debates). It is based on Classical Arabic and is not a native language of any Arabic speaking people, i.e., children do not learn it from their parents but in school. Thus most native speakers of Arabic are unable to produce sustained spontaneous MSA. The dialects show phonological, morphological, lexical, and syntactic differences comparable to This work was primarily carried out while the first author was at the University of Maryland Institute for Advanced Computer Studies. those among the Romance languages. They vary not only along a geographical continuum but also with other sociolinguistic variables such as the urban/rural/Bedouin dimension. The multidialectal situation has important negative consequences for Arabic natural language processing (NLP): since the spoken dialects are not officially written and do not have standard orthography, it is very costly to obtain adequate corpora, even unannotated corpora, to use for training NLP tools such as parsers. Furthermore, there are almost no parallel corpora involving one dialect and MSA. The question thus arises how to create a statistical parser for an Arabic dialect, when statistical parsers are typically trained on large corpora of parse trees. We present one solution to this problem, based on the assumption that it is easier to manually create new resources that relate a dialect to MSA (lexicon and grammar) than it is to manually create syntactically annotated corpora in the dialect. In this paper, we deal with Levantine Arabic (LA). Our approach does not assume the existence of any annotated LA corpus (except for development and testing), nor of a parallel LA-MSA corpus. The approach described in this paper uses a special parameterization of stochastic synchronous TAG (Shieber, 1994) which we call a “hidden TAG model.” This model couples a model of MSA trees, learned from the Arabic Treebank, with a model of MSA-LA translation, which is initialized by hand and then trained in an unsupervised fashion. Parsing new LA sentences then entails simultaneously building a forest of MSA trees and the corresponding forest of LA trees. Our implementation uses an extension of our monolingual parser (Chiang, 2000) based on tree-substitution grammar with sister adjunction (TSG+SA). The main contributions of this paper are as follows: 1. We introduce the novel concept of a hidden TAG model. 2. We use this model to combine statistical approaches with grammar engineering (specifically motivated from the linguistic facts). Our approach thus exemplifies the specific strength of a grammar-based approach. 3. We present an implementation of stochastic synchronous TAG that incorporates various facilities useful for training on real-world data: sister-adjunction (needed for generating the flat structures found in most treebanks), smoothing, and Inside-Outside reestimation. This paper is structured as follows. We first briefly discuss related work (Section 2) and some of the linguistic facts that motivate this work (Section 3). We then present the formalism, probabilistic model, and parsing algorithm (Section 4). Finally, we discuss the manual grammar engineering (Section 5) and evaluation (Section 6). 2 Related Work This paper is part of a larger investigation into parsing Arabic dialects (Rambow et al., 2005; Chiang et al., 2006). In that investigation, we examined three different approaches: • Sentence transduction, in which a dialect sentence is roughly translated into one or more MSA sentences and then parsed by an MSA parser. • Treebank transduction, in which the MSA treebank is transduced into an approximation of a LA treebank, on which a LA parer is then trained. • Grammar transduction, which is the name given in the overview papers to the approach discussed in this paper. The present paper provides for the first time a complete technical presentation of this approach. Overall, grammar transduction outperformed the other two approaches. In other work, there has been a fair amount of interest in parsing one language using another language, see for example (Smith and Smith, 2004; Hwa et al., 2004). Much of this work, like ours, relies on synchronous grammars (CFGs). However, these approaches rely on parallel corpora. For MSA and its dialects, there are no naturally occurring parallel corpora. It is this fact that has led us to investigate the use of explicit linguistic knowledge to complement machine learning. 3 Linguistic Facts We illustrate the differences between LA and MSA using an example: (1) a. (LA) AlrjAl the-men byHbw like $ not Al$gl the-work hdA this the men do not like this work b. ! " #$ % "&" ' ( (MSA) lA not yHb like AlrjAl the-men h*A this AlEml the-work the men do not like this work Lexically, we observe that the word for ‘work’ is ) Al$gl in LA but * AlEml in MSA. In contrast, the word for ‘men’ is the same in both LA and MSA: AlrjAl. There are typically also differences in function words, in our example $ (LA) and ( lA (MSA) for ‘not’. Morphologically, we see that LA byHbw has the same stem as MA &+ ' yHb, but with two additional morphemes: the present aspect marker bwhich does not exist in MSA, and the agreement marker -w, which is used in MSA only in subject-initial sentences, while in LA it is always used. Syntactically, we observe three differences. First, the subject precedes the verb in LA (SVO order), but follows in MSA (VSO order). This is in fact not a strict requirement, but a strong preference: both varieties allow both orders, but in the dialects, the SVO order is more common, while in MSA, the VSO order is more common. Second, we see that the demonstrative determiner follows the noun in LA, but precedes it in MSA. Finally, we see that the negation marker follows the verb in LA, while it precedes the verb in MSA. (Levantine also has other negation markers that precede the verb, as well as the circumfix m-$.) The two phrase structure trees are shown in Figure 1 in the convention of the Linguistic Data Consortium (Maamouri et al., 2004). Unlike the phrase
منابع مشابه
Hyperedge Replacement and Nonprojective Dependency Structures
Synchronous Hyperedge Replacement Graph Grammars (SHRG) can be used to translate between strings and graphs. In this paper, we study the capacity of these grammars to create non-projective dependency graphs. As an example, we use languages that contain cross serial dependencies. Lexicalized hyperedge replacement grammars can derive string languages (as path graphs) that contain an arbitrary num...
متن کاملSynchronous Context-Free Tree Grammars
We consider pairs of context-free tree grammars combined through synchronous rewriting. The resulting formalism is at least as powerful as synchronous tree adjoining grammars and linear, nondeleting macro tree transducers, while the parsing complexity remains polynomial. Its power is subsumed by context-free hypergraph grammars. The new formalism has an alternative characterization in terms of ...
متن کاملRelating Tabular Parsing Algorithms for Lig and Tag
Tree Adjoining Grammars (TAG) and Linear Indexed Grammars (LIG) are extensions of Context Free Grammars that generate the class of Tree Adjoining Languages. Taking advantage of this property, and providing a method for translating a TAG into a LIG, we define several parsing algorithms for TAG on the basis of their equivalent LIG parsers. We also explore why some practical optimizations for TAG ...
متن کاملSynchronous Dependency Insertion Grammars: A Grammar Formalism For Syntax Based Statistical MT
This paper introduces a grammar formalism specifically designed for syntax-based statistical machine translation. The synchronous grammar formalism we propose in this paper takes into consideration the pervasive structure divergence between languages, which many other synchronous grammars are unable to model. A Dependency Insertion Grammars (DIG) is a generative grammar formalism that captures ...
متن کاملTuLiPA - Parsing Extensions of TAG with Range Concatenation Grammars
In this paper we present a parsing framework for extensions of Tree Adjoining Grammars (TAG) called TuLiPA (Tübingen Linguistic Parsing Architecture). In particular, besides TAG, the parser can process Tree-Tuple MCTAG with shared nodes (TT-MCTAG), a TAG-extension that has been proposed to deal with scrambling in free word order languages such as German. The central strategy of the parser is su...
متن کامل